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Abstract 

Let 1Z be a ring containing a principal N = (2 fe ) th root of unity. We 
present an algorithm which, given a polynomial f(z) over 1Z of degree 
less than n < N, gives a vector of n weighted evaluations of f(z), using 
T^nlogn + 0(n) ring multiplications. This algorithm requires only the 
space for the input itself and additional 0(1) ring elements and bounded- 
precision integers. The algorithm uses a linear-time method of breaking 
f(z) into reduced images modulo polynomials of the form z m + 1, m a 
power of two, then uses the Fast Fourier Transform (FFT) to get a vector 
of evaluation points of each image. The result is an in-place truncated 
Fourier transform with complexity comparable to the truncated Fourier 
transform of Van der Hoeven. Using this algorithm we give an in-place 
algorithm for polynomial multiplication requiring space for only the inputs 
and output and an additional 0(1) bounded-precision integers and ring 
elements. 



1 Introduction 

The Fast Fourier Transform (FFT) is an algorithm which maps a polynomial 
f(z) over a ring 72. to a vector comprised of n evaluations of f(z) using 0(n log n) 
arithmetic ring operations. This algorithm lends itself to fast integer and poly- 
nomial multiplication. The radix-2 Cooley-Tukey FFT |T| requires that the in- 
put have a power-of-two size. This results in jumps in space and time complexity 
at power-of-two sized inputs. Truncated Fourier Transform (TFT) algorithms 
by Van der Hoeven ^6j and by Harvey and Roche [J] have addressed these jumps 
in time and space complexity, respectively. 

In this paper we present a new in-place TFT algorithm that improves on the 
complexity of the Harvey- Roche TFT, in terms of the number of ring multipli- 
cations, by a constant factor. By in-place, we mean that the algorithm, acting 
on a size-rt input, writes its result in place of its input and only requires space 
for the output itself plus an additional 0(1) ring elements and integers bounded 
by an for a fixed constant c. We then show how this algorithm may be used 
towards polynomial multiplication. 

In this section we describe the FFT and the previous TFT algorithms. 
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1.1 The discrete Fourier transform and its inverse 



The discrete Fourier transform (DFT) of a polynomial f(z) is its vector of 
evaluations at the distinct powers of a root of unity. Specifically, if f(z) — 
Y^i=o aiZ% an d ^ contains an n th primitive root of unity u, then we define the 
discrete Fourier transform of f(z) as 

DFT™(/) = (/M) < j<n . (1) 

We treat the vector a = (ao, ai, . . . , a n —i) an d / as equivalent and use DFT™ (a) 
and DFT™ (/) interchangeably. The map DFT™ : Tl[z\/(z n - 1) ->• H n forms an 
additive group isomorphism. If uj is a principal root of unity, that is, for j not 
divisible by n, Y^i=o cjU = ^> then we can obtain / mod (z™ — 1) by way of its 
inverse map, IDFT" : K n -)• K[z]/(z n - 1), defined by 

iDFT",(a ,...,a„) = ±BFT u -i (a , . . . , a n ), (2) 

where we again we treat a polynomial as equivalent to its vector of coefficients. 
The DFT gives a polynomial multiplication algorithm: 

Theorem 1 (The Convolution Theorem). 

fg mod (z™ - 1) - IDFT" (DFT W (/) • DFT W ( 5 )) , (3) 

where "■ " is the vector dot-product, and, given two polynomial g(z), h(z) € "R\z\, 
g{z) mod h(z) denotes the unique polynomial r(z) such that h(z) divides g{z) — 
r{z) and deg(r) < deg(ft.) throughout. 



1.2 The Fast Fourier Transform 

We can compute the discrete Fourier transform of f(z), f reduced modulo (z™ — 
1), by way of the Fast Fourier Transform (FFT), which recursively reduces a 
DFT into smaller DFTs based on the factorization of n. The radix-2 FFT of 
Cooley and Tukey [I], computes a DFT for n a power-of-two by a divide-and- 
conquer approach. We break f(z) into two polynomials fo(z) and fi(z), where 

n/2-1 n/2-1 

/oO) = a2kzk > h( z ) = Yl a ^+\z k - (4) 

k=0 k=Q 

Note that f{z) = f (z 2 ) + zf 1 (z 2 ). Thus, as c^+™/ 2 = -u>i, and uj 2 ^ = uj 2 U+ n / 2 ) 
for an n th root of unity lu, we have that 

f(w*) = f (u 2j ) + uPhiuF), f(^ +n/2 ) = fo(oJ 2 i) - uSfxiojV), (5) 

for < j < n/2. The pair of computations (JS|) are known as a butterfly op- 
eration. We perform these butterfly operations in place. Note that the values 
/i(w 2j ) comprise DFT w 2(/ i ) for i = 0, 1, two DFTs of size n/2. Thus, to com- 
pute the discrete Fourier transform of /, we split / into polynomials /o and 
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fx, compute their size-n/2 DFTs with root-of- unity ui 2 , and then compute the 
butterfly operations ([5]), for < j < n, to obtain the evaluations of / at the 
powers of w. 

If the butterfly operations are performed in place, the resulting evaluations 
f{uj J ) will be written in bit-reversed order. More precisely, if we let rev c (j) 
denote the integer resulting from reversing the first c bits of j, we have that 
f{uj J ) will be written in place of a^, where k = revi og („) (J) and log is taken to 
be base-2 throughout. As an example, 

rev 5 (ll) = rev 5 (01011 2 ) = 11010 2 = 16 + 8 + 2 = 26. (6) 



Procedure FFT(a, n, w), an in-place implementation of the radix-2 

Cooley-Tukey FFT 

Input: 

• a = (a(0), . . . , a(n — 1)), an array containing a vector a G lZ n . 

• u>, a root of z" — 1, for n a power of two. 

Result: DFT™(a) is written to the array a in bit-reversed order. 

l for i i — 1 to log 2 (n) do 
m i — n/2 l 

II Complete m DFTs of size n/m 

3 ( 7 ,C)^(l,w m ) 

4 for J^Otoi-ldo 

5 I i — revi(j)m 

6 for k < — to m — 1 do 

7 (h,h)< — (l + k,l + k + m) 
II Butterfly operations ([5]) 

a(ii),a(f 2 )) ^— (a(f 1 )+7a(i 2 ),a(f 1 )-7a(i 2 
7 < — 7C 



Procedures IFFTI and HFFTl describe implementations of the in-place FFT and 
its inverse, made non-recursive so as to make them in-place. Normally one 
would pre-compute powers of w and store them in an array in bit-reversed 
order. To avoid this space overhead we perform the butterfly operations in an 
order such that we can compute the required powers of w m sequentially. For 
better memory performance, one can produce the powers of uj in an order that 
allows us to traverse the array in sequential order, at the cost of an additional 
0(n) ring multiplications. 

The elements 7, ui, C, and array a comprise all the ring elements in the al- 
gorithm. Our only ring operations are due to the butterfly operations (line [S]) 
and computing the powers of u; (lines [J2 ©. w m can be computed by a square- 
and-multiply approach using 0(log m) multiplications. Enumerating these op- 
erations gives us the following cost: 



3 



Procedure IFFT(a, n, u>) , an in place implementation of the inverse-FFT 

Input: a = (a(0), . . . , a(n — 1)), an array containing DFT"(a), for a 

vector a £ 7l n , for n a power of two, in bit-reversed order. 
Result: The vector a is written to array a in place of its DFT. 

1 for i < — 1 to log 2 (n) do 
to i — 2 i ~ l 

// Complete ^ DFTs of length 2m with root of unity lo^ 1 
( 7 ,C) <— (l,u;-™) 
for j i — to — 1 do 
for k < — to m do 

I i — 2km 

(h,h) <— (j + lj + l + m) 

II Butterfly operations (f5]) 

a(h),a(l2)) <— fa(f 1 )+7a(i 2 ) J a(f 1 )-7a(i 2 ; 
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7C 



10 for 



to n — 1 do a(i) 



±a(i) 



Lemma 2. Lei n be a power of two and let a 6e an array containing the co- 
efficients of a polynomial f{z) reduced modulo (z n — 1). Computing DFT™ (a) 
by way of procedure IFFTl requires n log n + 0(1) additions and log n + O(n) 
multiplications in 1Z. Procedure \IFFT\ has similar complexity. 

The approach described for n a power of two generalizes for n a prime power 
or a product of small primes, though in practise n is almost always chosen to 
be a power-of-2. 

Using the radix-2 FFT, if d = deg(fg), we choose n to be the least power of 
2 exceeding d. This entails appending zeros to the input arrays a and b. By this 
method, computing a product of degree 2 fc requires the same time and space 
cost as computing a product of degree 2 fe + to, for < m < 2 k . It would be 
preferable to have a radix-2 FFT algorithm whose time cost grows smoothly 
relative to n, and which requires n + 0(1) space for arbitrary integers n > 0, 
and not just for n a power of two. Crandall's devil's convolution algorithm |2J 
somewhat flattens these jumps in complexity, though not entirely. It works by 
reducing a discrete convolution of arbitrary length into more easily computable 
convolutions. More recently, truncated Fourier transform (TFT) algorithms, 
described hereafter, have addressed this issue. 

1.3 Truncated Fourier Transform algorithms 

The Truncated Fourier Transform (TFT) algorithm due to Van der Hoeven [6] 
returns a pruned DFT such that the result contains as few evaluations as is 
necessary. Suppose deg(fg) < n and N = 2 k where k = [log 2 n\ . If u; is an N th 
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primitive root of unity, we define the TFT of f(z) with parameters u> and n to 
be the vector of evaluation points 

TFT" (/) = (/K cv "(°>), /K CVfc(1) ), . . . , /K ^"- 1 ')) . (7) 

It comprises the first n terms of DFT"(/), written in bit- reversed order. The 
TFT algorithm works by discarding unnecessary butterfly operations ([3]) where 
at least one of its inputs are 0. The inverse transform is more involved. Given 
a length- N vector a, the inverse TFT takes TFT™ ((ao, . . . , a„_i)) and the non- 
transformed vector coefficients (a„ + i, . . . , ajv-i) and returns the entire vector 
a. For our purposes a is the vector of coefficients of / and ai is presumed to be 
zero for I > n. The algorithm relies on the fact that one can obtain any two 
values in a butterfly relation ([5]) given the other two values by solving a 2 x 2 
linear system. This method still requires space for N + 0(1) ring elements. 

Theorem 3 (Van der Hoeven). The TFT algorithm computes TFT™(/) using 
n [log n] +0(n) ring additions and ^(n\\ogn~\) + 0(n) multiplications by powers 
of us. The inverse TFT algorithm requires similarly many multiplications by 
powers ofui, and n [log n~\ + 0(n) shifted ring additions. 

A shifted addition is merely the combined operation of an addition plus, 
possibly, a multiplication by 2 . For floating-point numbers and x G Z/pZ, 
multiplication by 2 , using bit shifts and bit operations, has bit-complexity 
comparable to addition. We note that Van der Hoeven computed more precise 
bounds on the number of ring operations (i.e. without "big-Oh" notation). We 
chose to write the cost as is to simplify the expressions in terms of n. Van der 
Hoeven's algorithm also generalizes to allow us to compute arbitrary subsets of 
DFTs. We define, for a set I C {0, 1, . . . , N - 1}, 

tft£(/) = (/K OVfcW )) . (8) 

By this notation TFT™ = TFT^, for / = {0, 1, . . . , n - 1}. 

More recently, Harvey and Roche [4] developed an in-place variant of the 
TFT which writes TFT™ in place of (ai)o<i<n- This algorithm was used to- 
wards fast in-place polynomial multiplication. This algorithm is implemented 
iteratively to avoid using an additional O(logrt) space for the stack, but the 
algorithm effectively has a recursive structure. We first recursively compute 

TFT r«/21 (/q) = /o(w a»v*(i)) ) o < i < [n/2], (9) 

in place of the subset of coefficients of /, (ao, 02, ... , CLf n /2~\ )■ Then, if n is odd, 
we compute 7 = fi(ui 21 ), where I = reVfe([n/2]). Such an evaluation takes 0(n) 
ring operations. We then compute 

TFT Ln/2j (A) = /i(w2ra v fcW); < • < [n/2ji (1Q) 

in place of the remaining coefficients of fx, [ax, 03, . . . , o,\ n /2\))- These two 
smaller TFTs, and 7 if n is odd, gives us the information requried to complete 
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the butterfly relations needed to compute TFT™(/). These additional linear- 
time evaluations make the Harvey-Roche truncated Fourier transform require, 
at worst, a constant factor additional ring operations. 

Theorem 4 (Roche [5], Chapter 3). The in-place truncated Fourier transform 
requires at most |n[logn] + ring multiplications to compute TFT™(/). 

1.4 The Half-DFT 

In the TFT algorithm of this paper we will use the half-DFT (HDFT). For uj, 
a root of z 11 + 1, the half-DFT is the map defined by 

HDFT™(/) = (/(uj 1 ), f(uj 3 ), . . . , /(c 2 "- 1 )) . (11) 

It is the evaluation of / at the roots of z n + 1. The half-DFT can be computed 
by way of a DFT [SJ. Namely, if we let a' be the vector defined by a'j = ajUji 
for < j < n, then, up to re-ordering, 

HDFT™ (a) = DFT" 2 (a'). (12) 

Procedure IHalf DFTl describes an implementation of this approach. 

Procedure Half DFT (a, n, uj) , an in-place algorithm for computing the 

half-DFT 

Input: 

• a = (a(0), . . . , a(n — 1)), an array containing a vector a G lZ n . 

• w, a root of z n + 1, where n is a power of two. 
Result: The half-DFT HDFT™ (a) is written in place of a 

1 7 < — 1 

2 for i i — to n — 1 do 

3 a(i) < — 7a(i) 

4 7 < — 7cj 

5 [FFTT a. n, w 2 ) 

Procedure InverseHalf DFT (a, n, uj) , an in-place algorithm for inverting 

the half-DFT 

Input: 

• a = (a(0), . . . , a(n — 1)), an array containing Half DFT™(a) for some 
a eft". 

• uj, a root of z n + 1, where n is a power of two. 
Result: a is written in place of HDFT" (a) 

1 HFFTf a. n, uj 2 ) 

2 7 < — 1 

3 for i i — to n — 1 do 

4 a(i) < — 7a(i) 

5 7 •< — 7cj _1 



Inverting the half-DFT is straightforward: we apply an inverse FFT and 
then multiply the i th coefficient by uj~ 1 . We have the following complexities. 
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Lemma 5. Letn be a vomer of two. [Half DFff a. n, uj) awrf llnverseHalf DFlf a, n, oj) 

both require 77 log n + 0(1) additions, and ^n\og 2 n + 0(n) multiplications. 

The cost of lHalf DFlT a. n, 7) is merely the cost of an FFT plus an additional 
2n ring multiplications. 



2 An in-place TFT algorithm 

2.1 Description of the result of the algorithm and notation 

We use the following notation throughout section [2] and thereafter. Suppose 
now that we have a polynomial f(z) = ^2j = q OjZ^ G of degree at most 

77. — 1 , where n can be any positive integer. Write n as n — J2i=i n ii where 
ni = 2™W, 77(7) € Z>o and 77^ > 77j for 1 < 7 < j < s. For 1 < i < s, we let 

i 

®i = z ni + l and ri = JJ$i. (13) 

Note that $i mod <Fj = 2 and 1^ mod $ 3 = 2 l for j > i. Our transform 
algorithm will compute the evaluations of f(z) at the roots of $i. Namely, if we 
fix a canonical root u) — lo\ of $1, and then let, for 2 < i < s, w,; = a;"^'" , a 
root of $i , we will compute the vector of evaluations 

/(wf) for 0<j<77i, 1<?:<s. (14) 

If we let 

I(n) = {0 < k < 77 : for some i,l < i < s,ni < k < 2ni}, (15) 

then the set of evaluations we compute is exactly TFT^™^/). 

The roots of unity used as evaluation points in TFT^ "-*(/) will differ from 
those used in TFT™(/). For instance, TFT™ will always use all roots-of- unity 
of order dividing 771, whereas TFT^™-* will instead use all roots of unity of order 

2m. 

We define, for 1 < i < s, the images 

f i = fmod$ i and ./V 2 '•'/,. (16) 

We call the polynomials /* the weighted images of /. We will first compute /* 
as they are less costly to compute than the images fa. Once we have all the 
weighted images /* we will reweigh them to obtain the unweighted images fa. 
We also define, for 1 < 7 < s, the images 

d = f mod r,. (17) 

We call the polynomials Ci the combined images of /. We also use 

/ ifi = , . 

/ quo r, ; if 1 < i < s v ' 
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the quotient produced by dividing / by I\ and 

n* = n mod rij. (19) 

It is straightforward to obtain q i: given /. Note that the degrees of any two 
distinct terms of I\ differ by at least rii, and that deg(qi) < rij. Thus, as I\ is 
monic, we have that the coefficients of qi merely comprise the coefficients of the 
higher-degree terms of /. More precisely, 

q { = a n _i2 (n * _1) + a„_ 2 z ( ™^ 2) H h a„_„»+iz + a n _ n j. (20) 

By a similar argument, we also have that, for 1 < i < s, 

q% = qi-i quo (21) 

We note that q s = 0. We will also use the remainder resulting from dividing q^ 
by $i, 

r t = g t _i mod $„ 1 < i < s. (22) 

Our transform has three steps: we first iteratively break / in-place into the 
remainders rf, we then iteratively write /* in place of for i = 1,2, ...,s; 
lastly we reweigh the weighted images /* to get fa and perform a half-DFT on 
each image ft separately to give us the weighted evaluation points. 

2.2 Breaking / into the sequence of remainders r 1; . . . ,r s 

We first break / = q into its quotient and remainder dividing by z ni + 1, 

711—1 

n = / mod (z" 1 + 1) = K - Oi+m ) z\ (23) 

i=0 

Ql = f quo (z" 1 a <+^ z *> ( 24 ) 

i=0 

where a, = for i > n. This can be done in place with ri\ subtractions in 1Z. 
We then similarly break q\ into r 2 and q-i , then g 2 into r 3 and (73, and continue 
until we have n, . . . , r s _i an d ? s -i- Since deg(g s _i) < n s = dcg($ s ), r s is 
exactly q s -i- 

This process is completely reversible. From = mod $i and <7i = 
<7i_i quo $i we can easily obtain qi-\. If we write ri = ^2™^ hz 1 and = 

Z)"=o Ci^*. thcn 

ft-i =(60 + co) + (&i + ci)z + • • • + (&„._! + cj-ijW - ^ (25) 
&„.«"«* + ■■■ + 6„ ( -i« n '- 1 + 
c z" s H hc„./-*-i. 

As r s = q s _i mod <f> s = g a _i, we can obtain g s _i from r s _i and r s . We then 
can obtain qj from and rj + \ until we have qo = f. 



8 



2.3 Writing the weighted images /* in place of the remain- 
ders Ti 

We first note that /* is precisely n . We will iteratively produce the remaining 
weighted images. 

Suppose, at the start of the i th iteration, we have f\, - ■ ■ ,f*, and rj+i, . . . ,r s , 
and we want to write f* +1 in place of r i+i . We have 

/; +1 = 2-7niod* <+1) (26) 

= 2- 4 (r^ + a) mod* i+ i, (27) 

= (<fc + 2- i C < ) mod$i+i, (28) 

= (r m + (2- J C jm od$ J+1 .)) (29) 

Unfortunately, we don't have the combined image d , but rather the weighted 
images /*, 1 < j < i, from which we can reconstruct d- We would like to be 
able to compute d mod in place from the weighted images /*. To that 
end, the following lemma gives us a basis for d mod 3>i+i. 

Given an integer e, we will let e[i] refer to the z th bit of e, i.e. 

Llog(e)J 

e= ^ eH2 J , e[i]e{0,l}. (30) 

Lemma 6. Fix i, j and k, l<j<i<k<s. Suppose that f* = z e and 
fl = for I j , < I < i. Then d mod $^ is nonzero only if e{n{l)\ = 1 for 
j < I < i, in which case 

d mod $ fe = 2 t - 1 z e mod $ fc . (31) 

Given that z nk mod $fe = — 1, we have that 

2 t - 1 z e mod $ fe = (-l) e Kfc)] 2 *- V e mod " fc) , (32) 

where e mod is the integer e* such that rife|(e — e*) and < e* < n^. The 
values e[n(l)} can be determined from ni — 2 n ^> and e by way of a bitwise AND 
operation. 

Example 7. Suppose n = 86 = 64 + 16 + 4 + 2, and suppose that 

fi = f mod z 64 + 1 = z e , /| = 0, / 3 * = 0, 

and deg(/) < 64+16 + 4. Then one can check that C3 mod (z 2 + 1), where 
d = f rem [(z 64 + l)(z 16 + l)(z 4 + 1)] , satisfies 



d mod (z 2 + 1) = < 



' ifee[0, 20) U [24, 28) U [32, 52) U [56, 60), 

4 ife = 20,28,52, or 56, 

4z i/e = 21,29,53, or 57, 

-4 i/e = 22,30,54, or 58, 

-Az ife= 23,31,55, or 59. 
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Our lemma is stated more generally than what is needed for the algorithm, 
as we only need a result for Cj mod for 1 < i < s. However, in doing so 
this allows us to prove the lemma inductively. 

Proof of lemma [6j We fix e and j, and prove by induction on i. 

Base case: First consider the base case i = j, in which case the non-zero 
criterion of the lemma always holds and we need only show (|3"TT) . We have that 

d = 2*- 1 /* = 2 i ~ 1 z e (mod $ t ), (33) 

and, by Chinese remaindering, 

d = d-i + IVi (rr\ (2'-V - d-x) mod $,) . (34) 

Since / mod = for 1 < I < i, it follows that Cj_i = 0, and 

d = ^i-\z e modr 4 . (35) 

As e < rij, Ti_iz e is already necessarily reduced modulo Ti, and we have d is 
exactly Ti^\z e . Reducing this modulo we have 

d mod $ fe = 2 i ~V mod $ fe (36) 

as desired. 

Inductive step: Suppose now that the lemma holds for a fixed i > j, and 
consider d+i mod > i + 1. We suppose that f t * = for 1 < I ^ j < i + 1 

and /* = z e . As / mod <I>i+i = 0, Chinese remaindering gives us 

d+i =d- Ti (Tr^d mod . (37) 

We prove the inductive step by cases. 

Case 1: If e[n(Z)] = for some I, j < I < i, then by the induction hypothesis, 
d mod = 0. Thus, given ([57]). Cj+i = Cj and C l+1 mod $ fe = d mod 
By the induction hypothesis again, Cj mod $fe = 0, completing the proof for 
this case. 

Case 2: If e[n(7)] = 1 for j < I < i, then by the induction hypothesis and 

C, mod = 2 l ~V mod = 2 i - 1 (-l) e W i +i)] z (e mod (38) 
applying this to (|3"7) gives 

C l+1 = Ci-Ti {(-iyMi+D] z (emodn i+1 )\ mod r . +1 . (39) 

By inspection of the degrees of the polynomials appearing in (|39p . we have 
equality: 

C l+l = d - Ti ^(_l)«K*+i)]i-(« mod . (40) 
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Reducing this modulo $^ and applying the induction hypothesis to Ci mod 
gives us 

Ci+i mod $ fe = 2 i ~V - 2^ 1 (-l) e [ n ( t+1 ^z( emod " l + 1 ) mod $ fe (41) 
= 2 1 - 1 (l - (-1)^+^ z (e mod mod $ fc , (42) 

as z e ee z emodn<+i ( mod $fc ) Since x _ (_ 1 )e[n( l +i)] eva i uates to 2 if e [n(i + 
1)] = 1 and otherwise, this completes the proof. □ 

Lemma [6] tells us the contribution of /* towards subsequent images. If e 
satisfies the non-zero criterion of the lemma, then by (|29p . a term Cj. e z e of /*, 
j < i will contribute i(_i)e[n(i+i)] z (e modnt+i) to In ordcr to makc the 

contributions have weight ±1, we instead first reweigh Vi by 2 and compute 2/*, 
and then divide by 2 thereafter. lAddContributionsI adds the contribution of 
/*,..., f*_ l to /*. It also can subtract these contributions, for the purposes of 
the inverse TFT algorithm. 

According to lemma [51 fj,0<j<i, only a proportion of 2 l ~i of the terms 
of f* will have a non-zero contribution to f* +1 . Thus the total cost of adding 
contributions of /* towards /*, i > j, is less than 2#/* = 2nj. It follows that 
the total additions and subtractions in 1Z required to add all these contributions 
is bounded by 2n. Since lAddContributionsI only scales array ring elements by 
±1, we have the following complexity: 

Lemma 8. Let n — ^| =1 Tii. Canrnq lAddContributionsf a. n, i, add?) for 1 < 

i < s entails no more than 2n ring additions and no ring multiplications. 

In the manner we have chosen to add these contributions, we will have to 
make s— 1 passes through our array to add them all. One way we could avoid this 
is to instead add all the contributions from /-j*, and then all the contributions 
from /|, and so forth, adding up all the contributions from a single term at 
once. We could use that a term Cj :& z e of /* that does not contribute towards 
f*+i "will not contribute to for any k > i. This would reduce the number 
of passes we make through the larger portion of the array, though the cache 
performance of potentially writing to s — 1 images at once raises questions. 

When adding contributions to /*, any two terms Cj. e z e and Cj^ e »z e of /* 
whose bits e[n(Z)], e*[n(Z)] agree for j < I < i will both contribute to /* in the 
same fashion (i.e. we will either add or subtract both coefficients c^e and Ci je * 
to an image, or do nothing). Thus we need only inspect the non-zero criterion of 
one exponent e in a block of exponents kn^i < e < (k + l)n.j_i. Similarly, we 
need only inspect one exponent in a contiguous block of ni exponents in order 
to determine their shared value of (— l) e I n W]. 

Given an exponent that does not satisfy the non-zero criterion, our im- 
plementation will generate the next exponent that does satisfy the non-zero 
criterion, by way of bit operations. 

Procedure |BreakIntoImages( a, n) breaks / into the images fi. The pro- 
cedure effectively has three sections. In the first section of the algorithm, we 
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Procedure AddContributions (a, n, i, add?) 



Input 



• a, a length-n array containing /*, . . . , f*_i and 2r^, in that order. 

• add?, a boolean value. 

Result: The contribution of /*,..., f*_ l towards 2/* are added to 2r, 
or subtracted if add? is false. As a result we will have 2/* in 
place of 2rj. 

1 if add? then c < — 1 else c < 1 

2 outOf f set < — Sj=i n j 

3 for j < — 1 to ?' - 1 do 
// Add contribution of f* to /* 



inOf f set < — 5Zfc=i n i 
for e < — to rij — 1 do 



if e[n(l)} = for j < I < i then 

[_ a(outOf f set + (e mod m)) += c(-l) e ["( J )la(inOf f set + e) 



Procedure Breaklntolmages (a, n): An in-place algorithm to compute 
a vector of images of / 

Input: a, a length-n array containing the coefficients of / — J^i^o a i z% ■> 

where n = Yls—i n i-S 
Result: The images fi, 1 < i < s, are written in place of /. 

// Write ri,...,r s in place of / 

1 m < — 

2 for i i — 1 to s do 



for j < — m to 7i — rii — 1 do a(j) < — a(j) — a(j + Uj) 

m < — m + 



II Write the weighted image /* in place of r,; , for 1 < % < s 

5 m < — 

6 for i i — 2 to s do 



7 


for j i — m to m + n, - 


- 1 do a(j) <- 


- 2a(j) 


8 


lAddContributionsfa. n. 


i, true) 




9 


for j < — m to m + rii - 


- 1 do a(j) ^ 


- |a(j) 


10 


m < — m + m 







// Reweigh /* to get /j 
n m < — rii 
12 for i i 2 to s do 



13 
14 



for j <s — m to n — 1 do a(j) < — 2a(j) 

m < — m + rii 
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break / into the remainders rj. Producing entails n* < n% additions, and so 
producing all the Ti entails less than J2i=i n i — n rm S additions. 

In the second section we write the weighted images /* in place of the r\. 
Adding all the contributions, per lemma [5J requires 2n additions. We reweigh 
the last n — n\ coefficients of / by 2, then by \. This constitutes less than n 
such multiplications. 

In the third section we reweigh the weighted images /* to get the images 
fi. This entails less than n multiplications by 2. This gives us the following 
complexity: 

Lemma 9. Procedure |BreakIntoImages| ^a, n) entails no more than 3n addi- 
tions and In multiplications by 2 ±1 . 

2.4 Putting the algorithm together 



Procedure TFT (a, n, w): An in-place algorithm to compute a vector of 
evaluations of / 
Input: 

• a, a length-n array containing the coefficients of / = X^o* a i z% \ where 

• u>, a root of $i. 

Result: TFT^ n) (/) is written in place of /. 

1 |BreakIntoImages[ a, n) 

2 (7, TO) < (w, 0) 

3 for i i 1 to s do 

4 IHalf DFTf a + to. tu . 7) 

5 TO < TO + Tli 

6 if i < s then // Set 7 to root of 3>j+i 

7 |_ for j i — log(n i+ i) to log(n») 1 do 7 « — 7 2 



ITFTf a. n, w) gives a detailed description of the entire TFT algorithm. The 
elements in arrays a, and the elements 7 and lo comprise the ring elements in 
the algorithm. To introduce array notation, given an array a and integer m we 
will let a + m denote the array b where b(i) — a(i + to) for i > 0. 

We note that, in order to make the algorithm truly in-place, we cannot store 
all the values for 1 < i < s. As the values rii are always accessed sequentially, 
we could store the values n, n\, and n s , and have functions which, given n and 
ni, produces rii-i and nj+i respectively. This could be done by way of bit shifts 
and bit masks. 

Theorem 10. ITFTf a. n, a?) entails fewer than 
• ^nlogn + 0(n) ring multiplications, 
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• n log n + 0(n) ring additions, and 



• 2n multiplications by 2 . 

Proof. The complexity of \BreakIntoImages\ was given by lemma [5] The brunt 
of the work is due to the half-DFTs. Counting the number of multiplications 
due to the half-DFTs by way of lemma [3] will give us 

s 

J2 [§«< !og« 4 + 0(r»i)] < ^nlogn + 0(n), (43) 

as log ni < \ogn and J2t=i ni = n - ^ similar analysis gives us nlogn + 0(n) 
ring additions. □ 



2.5 Obtaining a polynomial from its weighted TFT 

We need a means of reversing the algorithm: constructing the coefficients of / 
from its evaluation points. Every step of the forward transform algorithm is 
reversible. We can reobtain / as follows: 

1. Reobtain /j from its respective half-DFT, for 1 < i < s bv lInverseHalf DFTl 
Reweigh each fa to obtain /*. 

2. For i — s down to i = 1, subtract all the contributions from /j to obtain 

n. 

3. Recall that q s -i is exactly r s . For i = s — 1 down to 1, write in 
place of Qi and r^, per the method described in section T2.2I The resulting 
polynomial is go = f ■ 

|PutImagesBackTogether| (a, n) will, given an array a containing the weighted 
images /i, . . . ,/ s , write / to the array a in place of its weighted images. It 
has complexity comparable to BreakIntoImages\ The function UnverseTFT] 
inverts a TFT, with the same stated complexity as ITFTf a. n,ui). 



3 In-place polynomial multiplication 

We present an algorithm for in-place multiplication of polynomials. In this 
algorithm, the input polynomials are preserved, and the only working space 
is an array c allocated for the output product, and space for an additional 
0(1) ring elements and 0(1) integers bounded by n. The algorithm mimics the 
structure of the Harvey-Roche algorithm for in-place polynomial multiplication 
|3] , but instead uses the TFT algorithm presented here. 

Let /, g e TZ[z], and let n = deg(/) + deg(g) + 1, and let s,n i: uJi, and $i, 
1 < i < s be as defined in section El The algorithm will write TFT^, ( ™ ) (fg) 
to the array c, and then obtain h = fg by performing an inverse- TFT on the 
output array. 
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Procedure PutlmagesBackTogether (a, n): An in-place algorithm to 
compute / given its images 

Input: a, a length- 77 array containing the images fx, . . . , f s . 

Result: The polynomial / is written in place of its weighted images. 

// Reweigh the images to get / * , for 1 < i < s 

1 m < — ri\ 

2 for i i — 2 to s do 

3 for j i — m to n — 1 do a(j) < — 

4 m < — m + rii 

II Write Ti in place of / * , for i = s, s — 1, . . . , 1 

5 (i, m) i — (s, n) 

6 while i > do 

7 771 i m — TLi 

8 for j i — m to m + rii — 1 do a(j) < — 2a(j) 

9 lAddContributionsf a. n. i. f alse ) 

10 for j i — 77i to m + rii — 1 do a(j) < — 

// Write / in place of the remainders n 

n (i, m) < — (s, n) 
12 while i > do 
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777 < 777 — Tli 

for j < — 777 to 77 — rii — 1 do a(j) < — a(j) + a(j + rii) 



Procedure InverseTFT(a, 77, lu) : An in-place algorithm to obtain / from 

TFTff">(/) 

Input: 

• a, a length-n array containing TFT^ n ^(/), where n — Y2t=i n i- 

• uj, a root of $i. 

Result: / is written in place of TFT^, (rl) (/). 

1 (7, m) i — (w, 0) 

2 for i i — 1 to s do 
UnverseHalf DFTT a + 777,77^, 7) 

777 i 777 + 77i 

if i < s then // Set 7 to root of 

|_ for j i — log(77 i+ i) to log(ni) 1 do 7 < — 7 

7 |PutImagesBackTogether| [a, 77) 



2 
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Let hi(z) — h(z) mod 1 < i < s, be the images of h = fg. We will 
iteratively write the half-DFTs HDFT™ 1 . (h) into the output array, giving us 
TFT^'™^ (h). So as not to require more space, these half-DFTs will be computed 
in chunks of non-increasing, power-of-two sizes. 

We suppose that HDFT"jj (hj) has already been written in order to the front 
of output array for 1 < j < i. Let c be the array comprising the remaining 
n-i free space in our output array. 

To write HDFT™.(/g) to c, we will let I be the least power of two such 
that we have 21 free array elements remaining. We handle the case where we 
have only one free array element separately. We will write length-^ sections 
of HDFT™* (/) and HDFT™* (g) to unused space in c, and then compute their 
pointwise product in place of the section of HDFT™ ; (/), after which the length-^ 
section of HDFT™ (g) can be discarded and its array space reused. 

If i < s, the sections will be of length I = n.;/4, . . . , and then we 
will compute a last additional length-nj + i section. If i = s, we will compute 
sections of length n s /2,n s /A, . . . , 1, plus an additional length-1 section that is 
computed using auxiliary space for an additional ring element. Computing each 
section of the half-DFT will entail one pass through our inputs / and g. Thus 
in total over the entire multiplication algorithm we will pass through the input 
arrays log(n) + s times. 

Fix < k < log(rii) and define, for 1 < j < k, 

= z n *' 2i - ^T' V and $* = z"'/ 2fc -cp /2fe . (44) 
Then <I>i = z ni + 1 factors as 

For i < s we will choose k = log(ni — n^+i) and if i — s we will let log(n s ). 
Moreover, the roots of <&jj are of the form z = Pj"lj, where pj — — 1 , and 7^ 
is an (rii/2 : ') th root of unity. Thus, to compute the section of HDFT™^/) (and 
similarly for 3), comprising the evaluation of / at the roots of we will first 

compute /' = f(pjz) mod (W 2 ' - 1), and then compute DFT^ i/2J (/'). A root 
of $* will be of the form z — pt+ijk, an d so the evaluation of / and g at the 
roots of <&* can be computed similarly. 

We leave it as an exercise to the reader to check that this will produce 
the evaluation of h at the roots of z ni + 1 in the same order as procedure 
IHalfDFTf c.n,.a;,y 

After we have computed TFT^ n) (h), we then can perform the inverse TFT 
to give us h(z). Algorithm [1] describes the multiplication algorithm. 
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Procedure ComputeHalf DFTof Product (/, g, i, Ui, c) 



Input: ; 

• /, g € H[z\, two polynomials such that deg(/<?) = n — 1, where n can be 
written as a sum of decreasing powers of two n = Yn=i n i- 

• i, an integer such that 1 < i < s. 

• oji, a fixed root of 4>j = z ni + 1. 

• c, a length-n array, where ni, comprising elements of 1Z. 
Result: HDFT".(/g) is written to the array c 



1 (f reeSpace, of f set) < — \ Y^j = i n i^ 
II Evaluate at roots of j and, if i < s, $ 

2 j i 1 

3 while f reeSpace > 1 do 

4 



5 

6 
7 
8 
9 
10 
11 

12 
13 



^ ^ 2 [log(f reeSpace)J — 1 

/ -\ , ( 2 J '-1 2m/l • , I ' 

(P,7,j) < — ,J + \ ) 

Write f(pz) mod z l — 1 to c(rn), . . . , c(of f set + £ — 1) 
Write g(/0z) mod z l — 1 to c(m + Z) . . . , c(of f set + 2Z — 1) 
IFFTT c + offset, I, 7) 
iFFTT c + offset + I, I, 7) 
for u < — to I — 1 do 
|^ c(of f set + u) < — c(of f set + ?i)c(of f set + I + u) 

f reeSpace < — f reeSpace — I 
offset < — offset + / 



// Evaluate at roots of $* if i = s 

14 if f reeSpace == 1 then 

2 k+i_ 1 
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k + l _ 

P < — u>; 

c(n — 1) i — [f(pz) mod z — 1) {g{pz) mod (z — 1)) 



Algorithm 1: In-place polynomial multiplication 



Input: ; 

• f,g € 7Z\z\, two polynomials such that deg(/<?) = n — 1, where n can be 
written as a sum of decreasing powers of two n — 5Zi=i 

• c, an array of size n. 

• to, a root of z" 1 + 1. 

Result: The coefficients of h = fg are written to c 

1 m i — 

2 for i i 1 to s do 
|ComputeHalf DFTof Product^ /, g, i, u ni/ni , c + m) 
m < — m + rii . 

5 UnverseTFlf c. n, w) 
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4 Implementation 



As a proof-of-concept, we implemented the WTFT algorithm, its inverse algo- 
rithm, and the algorithm for in-place multiplication via the WTFT. This imple- 
mentation is a module for the Maple computer algebra system. Our implemen- 
tation was done for polynomials over Z/pZ, where p = 2013265921 = 2 27 -15 + l. 
Our implementation cheats in that we store the values rij, 1 < i < s in a list. 
There may potentially be [logn] such values. This implementation, written in a 



Maple worksheet, is made available at cs .uwaterloo . ca/~a4arnold/code/tf t ,mw 
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